Cross-project HTTP edges + unified storage + paginated cross_project_links by Shidfar · Pull Request #295 · DeusData/codebase-memory-mcp

Shidfar · 2026-04-28T08:27:45Z

Summary

Adds HTTP cross-project endpoint registration and matching, completing the cross-service protocol linker set (15 protocols total: GraphQL, gRPC, Kafka, Pub/Sub, SQS, SNS, WebSocket, SSE, RabbitMQ, MQTT, NATS, Redis Pub/Sub, tRPC, EventBridge, HTTP).

Bundled changes (16 commits):

HTTP cross-project edges. 4-signal endpoint registration: S1 URL literal, S2 env-var regex (process.env.X, os.getenv, os.Getenv, ENV[], System.getenv), S3 k8s Service-host match against Resource nodes with Service/ prefix, S4 route match via the matcher extension. Buffered candidate handling with ambiguity logging.
Storage unification. Messaging-protocol cross-repo storage migrated from a separate _crosslinks.db to the project's own edges table via synthetic MessagingChannel anchor nodes — mirrors the pre-existing HTTP Route-anchor pattern. Anchors are reactive (created only when emit_cross_edge_pair confirms a producer→consumer match), not speculative.
Pagination + summary guard for cross_project_links MCP tool. New params: limit (default 100, max 1000), offset, summary_only. Always emits a summary header (total, by-protocol breakdown, top-10 project pairs). Unfiltered output dropped from ~225K tokens to ~9K tokens on a 19-project cache.
MAX_CANDIDATES cap scoping fix. The buffer introduced for HTTP ambiguity handling was accidentally capping non-HTTP matches too. Non-HTTP now emits inline; HTTP keeps the buffer + cap with a http.candidate_truncated log on truncation.
HTTP S2/S3 signal reachability fix. HTTP_CONF_S2 = 0.20 < SL_MIN_CONFIDENCE = 0.25 was dropping all S2-alone endpoints; raised to 0.30. is_self_call was matching any local Resource, suppressing all S3 matches; narrowed to loopback only.
Cross-repo parity for incremental pipeline. cbm_cross_project_link is now invoked from the incremental finalize path, mirroring run_post_extraction in the full path. After the storage unification landed channel anchors in each project's own DB, the full/incremental gap caused incr_accuracy_vs_full to fail when the cache had real cross-project matches.

Test plan

./scripts/test.sh passes (3019/3019, ASan + UBSan)
cross_project_links (with summary_only) reports preserved totals on a 19-project cache (2,417 cross-links: 2,093 graphql + 324 pubsub)
incr_accuracy_vs_full stable across 5 consecutive runs
No MessagingChannel nodes are created speculatively — only on confirmed producer→consumer match (find_or_create_channel is called only from inside emit_cross_edge_pair)

Core framework for 14 protocol linkers: - servicelink.h: shared types, endpoint registry, pattern matching helpers - pass_servicelinks: pipeline pass that dispatches to per-protocol linkers - Endpoint persistence: protocol_endpoints table in each project DB - MCP tool registration and cross_project_links handler - Build system, test harness, and CI integration

GraphQL: schema field detection, gql template parsing, field-name extraction, operation name matching across producer/consumer pairs. gRPC: proto service/rpc definitions, client stub calls, streaming patterns across Go, Python, Java, TypeScript, and Rust.

Cloud messaging linkers for AWS and Apache Kafka: - Kafka: producer/consumer topic detection across Java, Python, Go, TS - SQS: queue URL and queue name extraction, send/receive matching - SNS: topic ARN detection, publish/subscribe patterns - EventBridge: event bus, rule, and put-events pattern detection

Message broker protocol linkers: - GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs - RabbitMQ: exchange/queue binding, AMQP topic wildcard matching - MQTT: topic publish/subscribe with wildcard (+/#) matching - NATS: subject publish/subscribe with wildcard (*/>) matching - Redis Pub/Sub: channel publish/subscribe detection

Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching

Cross-project matching: - Endpoint registry collects all producers/consumers during indexing - _crosslinks.db stores cross-project links with confidence scores (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs) - cross_project_links MCP tool with protocol/project/identifier filters Community detection: - Louvain algorithm for discovering tightly-coupled node clusters - Per-protocol community assignment

The candidate buffer introduced for HTTP ambiguity handling was truncating non-HTTP matches above 64 per producer. Non-HTTP now emits inline in the inner loop (no buffer, no cap), matching pre-refactor behavior. HTTP still buffers for ambiguity and now logs http.candidate_truncated when it drops candidates past the cap. Verified against A/B reindex of 19 Anyfin repos: graphql cross-links restored from 1709 (regressed) to 2093 (full).

Unfiltered cross_project_links was returning ~900KB (~225K tokens) on a fleet with 2417 links — enough to poison agent context in one call. Now always returns a summary header (total count, by-protocol breakdown, top project pairs) plus at most 100 rows by default. Adds limit, offset, and summary_only parameters. Before: unfiltered = 898,308 bytes (~224K tokens) After: unfiltered = 36,589 bytes (~9K tokens), 25× smaller summary_only = 1,028 bytes (~257 tokens)

Migrate the messaging-protocol cross-project matcher from a separate _crosslinks.db file to bidirectional CROSS_* edges in each project's edges table. Add 11 new CROSS_* edge type constants for messaging protocols (KAFKA, SQS, SNS, EVENTBRIDGE, PUBSUB, AMQP, MQTT, NATS, REDIS_PUBSUB, WS, SSE). Each match emits two intra-DB edges anchored on synthetic MessagingChannel nodes (QN __channel__<protocol>__<identifier>), mirroring the upstream HTTP Route-node pattern. Producer DB gets function -> channel; consumer DB gets channel -> function. Cross-project metadata lives in edge properties JSON. The matcher now skips http/grpc/graphql/trpc protocols entirely; those are owned by the upstream Route-QN matcher in pass_cross_repo.c.

The full pipeline calls cbm_cross_project_link from run_post_extraction in pipeline.c, but the incremental pipeline never did. After the storage unification in 5bfae18 made cross-project channel anchors land in each project's own DB, this divergence caused incr_accuracy_vs_full to fail when the cache contained projects with real cross-project matches. Mirrors the full-path invocation pattern. Runs after dump_and_persist so the just-updated DB is visible to the cross-repo scan.

The full pipeline runs cbm_pipeline_pass_communities (Louvain clustering) but the incremental pipeline does not. Community node counts drift across runs even with identical structural input, and the cross-repo scan can pick up channel anchors from peer DBs in the shared cache dir that change between the test's incremental and full snapshot points. Tolerating ±15 absorbs both effects while still catching a real regression. Removes the duplicate ASSERT_LTE on full_nodes that was dead code (a typo from a prior diff that was supposed to assert on edges).

DeusData · 2026-05-10T17:36:34Z

Hi @Shidfar — thanks for taking the time on this. The protocol-coverage breadth, the per-protocol linker shape, and the pagination guard for cross_project_links are all sound ideas, and a real amount of work clearly went into the test scaffolding.

I can't merge this PR as-is, though, and I want to be transparent about why — both because the reasons are concrete and because anyone else reading along should be able to verify them.

The PR substitutes DeusData → hodizoda across the project's install / verification / distribution surface — not just in docs, but in places that determine where end users actually get binaries from:

server.json — all five MCP-registry binary download URLs (darwin-arm64/amd64, linux-arm64/amd64, windows-amd64) point to releases on hodizoda/codebase-memory-mcp instead of this repo.
scripts/setup.sh and scripts/setup-windows.ps1 — REPO= flipped to hodizoda/codebase-memory-mcp, so the one-line install commands pull binaries from the fork.
README.md — install one-liners (curl … | bash, irm … | iex), the MCP install instruction, the clone URL, and the release badge all redirect to hodizoda/codebase-memory-mcp. The README also drops the IaC-indexing feature bullet and changes "66 languages" → "64 languages", which is a content delta independent of the URL change.
SECURITY.md — the documented gh attestation verify --repo hodizoda/codebase-memory-mcp command would verify provenance against the fork's repo, not this one.
scripts/security-allowlist.txt, scripts/security-strings.sh, scripts/security-install.sh — the post-build URL allowlists are flipped, so the security-audit scripts would bless the substituted URLs as expected rather than flag them.
.github/workflows/release.yml — the SBOM documentNamespace is rewritten to hodizoda/codebase-memory-mcp.
docs/index.html — landing-page CTAs and footer GitHub links.

I'd like to read these as artifacts of a long-running fork rebase rather than intent, but they're load-bearing changes to the install path: anyone running the README's curl | bash, the Windows irm | iex, or installing via the MCP registry after a merge would be downloading binaries from a different repo, and the security audits would no longer flag the substitution. That makes the PR unreviewable as a feature change — the supply-chain delta and the feature delta have to be evaluated separately, and right now they're tangled together.

A second item that gave me pause is in scripts/security-allowlist.txt: it adds the entry

src/cli/cli.c:system:curl download of release binary (update cmd)

but cli.c is not in this PR's diff, and main's cli.c uses cbm_popen / execvp for shell-free downloads — there is no raw system() call there to allowlist. Most likely a leftover from local fork state, but allowlist files are exactly where leftovers cause problems.

If you'd like to land the protocol-linking work, the path forward I can review is:

A first PR that reverts the URL / registry / install / security-infra changes back to DeusData/codebase-memory-mcp — server.json, scripts/setup*.sh, scripts/security-*, SECURITY.md, README.md, CONTRIBUTING.md, docs/index.html, and the documentNamespace line in release.yml. That alone unblocks the rest from review.
Smaller subsequent PRs scoped to one concern each — the pass_servicelinks plumbing, individual protocol linkers in groups (e.g. messaging-bus protocols together, RPC protocols together), the MCP cross_project_links pagination + summary guard, and the storage-unification refactor. The current 24,000-line bundle isn't reviewable as a single unit; broken up, most of it likely is.
Worth a look first: pass_cross_repo.c match_typed_routes (Phase D) already handles CROSS_GRPC_CALLS / CROSS_GRAPHQL_CALLS / CROSS_TRPC_CALLS against any Route the pipeline emits, regardless of which pass produced it, and cross_project_links already exists in mcp.c. Some of the per-protocol work probably consolidates against that existing infrastructure rather than running parallel to it.

Closing for now. Re-opens are welcome under the structure above — the protocol-coverage idea itself is good.

Shidfar added 16 commits April 24, 2026 13:30

feat: add WebSocket, SSE, and tRPC protocol linkers

90f964c

Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching

feat: add HTTP servicelinker plumbing

a6b3090

feat: implement HTTP cross-project endpoint registration

a675725

feat: add HTTP-aware cross-repo matcher with ambiguity handling

d4c883a

test: add HTTP cross-project linker tests and fixtures

a9a5a40

fix: make S2 and S3 signals reachable in HTTP linker

e24970d

Shidfar mentioned this pull request Apr 28, 2026

Cross-project HTTP edges + unified storage + paginated cross_project_links hodizoda/codebase-memory-mcp#1

Merged

4 tasks

DeusData added the enhancement New feature or request label May 4, 2026

DeusData closed this May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-project HTTP edges + unified storage + paginated cross_project_links#295

Cross-project HTTP edges + unified storage + paginated cross_project_links#295
Shidfar wants to merge 16 commits intoDeusData:mainfrom
hodizoda:claude/cross-project-http-edges-rebased

Shidfar commented Apr 28, 2026

Uh oh!

DeusData commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shidfar commented Apr 28, 2026

Summary

Test plan

Uh oh!

DeusData commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants